perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128
Merged
perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128
Conversation
Captured baseline FPS + sample-based hot-function breakdown across 6 ROMs (yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer) on Apple Silicon. Result contradicts the assumptions in spikes #123 (OP) and #124 (blitter): yarc.j64 -> GPUExec ~74% of frame time Iron Soldier -> DSPExec ~67% of frame time Skyhammer fast -> GPU+DSP ~57% of frame time, blitter <2% Skyhammer acc -> blitter ~21%, but only on opt-in accurate OP -> 1% on Iron Soldier, doesn't crack top-15 elsewhere 68K -> 0.7-2.6% everywhere; JIT not worth the licensing The right next perf target is GPU/DSP RISC dynarec or cached IR (half of issue #122). Both share the same Tom RISC ISA, so a single dispatcher attacks both. docs/op-perf-profile.md captures the methodology, the ROM-by-ROM numbers, and the recommendation. Drive-by: quote $(BENCH_ROM) in the Makefile benchmark recipe so ROM paths with spaces / parens (every commercial ROM filename) work. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Original profile only sampled headless boot/menu state, which doesn't
exercise the blitter much. Added --load-state to test_benchmark
(handles RetroArch RASTATE container by extracting the MEM chunk),
loaded the user's AvP state6 save (active gameplay), and re-profiled.
Result: the spike reports' hypothesis about the blitter was right --
it just didn't show up in boot profiles.
AvP gameplay, fast blitter:
DSP ~63% (same as before), blitter 5%, GPU 4%, OP 1%
AvP gameplay, accurate blitter:
BlitterMidsummer2+ADDARRAY+DATA ~34% (matches docs/spike #124),
DSP ~31%, GPU small
OP stays at ~1% even on gameplay -- closing #123 as wontfix-for-now is
still the honest call.
Updated recommendation: two genuine targets, suggested order:
1. Accurate-blitter SIMD widening (#124) -- smaller surface, lower
risk, immediate visible win for AvP-style accurate-blitter
slowdown. ADDARRAY (src/tom/blitter.c:2090-2094) is the obvious
first target.
2. RISC dynarec / cached IR (#122 RISC half) -- bigger lift, helps
every game on every blitter mode.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Darwin arm64
Updated by CI at 2026-05-02T02:39:26.395Z
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Linux x86_64
Updated by CI at 2026-05-02T02:39:45.599Z
Author
Regression:
|
| ROM | Status | Details | Diff |
|---|---|---|---|
| jagniccc | ✅ PASS | 0 pixels differ | - |
| yarc | ✅ PASS | 0 pixels differ | - |
| jagniccc (determinism) | ✅ PASS | identical across runs | - |
| yarc (determinism) | ✅ PASS | identical across runs | - |
| jagniccc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| yarc (frameskip) | ✅ PASS | skip=0 matches skip=3 | - |
| jagniccc (save state) | ✅ PASS | round-trip matches | - |
| yarc (save state) | ✅ PASS | round-trip matches | - |
| jagniccc (rewind) | ✅ PASS | rewind matches | - |
| yarc (rewind) | ✅ PASS | rewind matches | - |
Platform: Linux aarch64
Updated by CI at 2026-05-02T02:39:56.368Z
Contributor
There was a problem hiding this comment.
Pull request overview
Establishes a performance baseline for Virtual Jaguar’s OP/blitter vs GPU/DSP execution, and updates the benchmark harness so commercial ROM paths work reliably (and benchmarks can load a save state to profile gameplay rather than boot/menu idle).
Changes:
- Add a new profiling write-up documenting measured hotspots and updated optimization priorities (
docs/op-perf-profile.md). - Extend the headless benchmark tool with
--load-statesupport (rawretro_serialize()payload or RetroArch RASTATE “MEM” chunk). - Quote
$(BENCH_ROM)inmake benchmarkso paths with spaces work.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
test/tools/test_benchmark.c |
Adds --load-state and RASTATE MEM-chunk extraction to allow profiling from in-game states. |
docs/op-perf-profile.md |
New documentation capturing the baseline/profile results and recommended perf focus. |
Makefile |
Quotes $(BENCH_ROM) in the benchmark target invocation to support ROM paths with spaces. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+383
to
+398
| uint8_t *st_buf; | ||
| const uint8_t *payload; | ||
| size_t payload_size; | ||
| size_t expected; | ||
| fseek(stf, 0, SEEK_END); | ||
| st_total = ftell(stf); | ||
| fseek(stf, 0, SEEK_SET); | ||
| st_buf = (uint8_t *)malloc(st_total); | ||
| if (fread(st_buf, 1, st_total, stf) != (size_t)st_total) | ||
| { | ||
| fprintf(stderr, "ERROR: short read on state file\n"); | ||
| free(st_buf); fclose(stf); return 1; | ||
| } | ||
| fclose(stf); | ||
| payload = st_buf; | ||
| payload_size = (size_t)st_total; |
Comment on lines
+405
to
+417
| while (p + 8 <= end) | ||
| { | ||
| uint32_t chunk_size = (uint32_t)p[4] | ((uint32_t)p[5] << 8) | ||
| | ((uint32_t)p[6] << 16) | ((uint32_t)p[7] << 24); | ||
| if (memcmp(p, "MEM ", 4) == 0) | ||
| { | ||
| payload = p + 8; | ||
| payload_size = chunk_size; | ||
| found = 1; | ||
| break; | ||
| } | ||
| p += 8 + chunk_size; | ||
| } |
Comment on lines
+375
to
+380
| FILE *stf = fopen(state_load_path, "rb"); | ||
| if (!stf) | ||
| { | ||
| fprintf(stderr, "ERROR: cannot open state file: %s\n", state_load_path); | ||
| return 1; | ||
| } |
Three concerns flagged by automated review on PR #128: * Validate fseek/ftell/malloc/fread when reading the state file (negative size, NULL malloc, short read) instead of trusting them. * Bounds-check RASTATE chunk_size against the buffer end before treating any bytes past the chunk header as payload, so a truncated/corrupt container can't push retro_unserialize past the allocation. * Route every state-load failure through a single cleanup label that closes the FILE, frees the buffer, and tears down retro_load_game / retro_init / dlopen before returning, matching the cleanup in the normal exit path. No functional change for valid state files. Smoke-tested with make benchmark (no --load-state) -- still reports FPS as before. Co-Authored-By: Claude Opus 4.7 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Set up the OP perf branch per the plan, then profiled before writing any code. The profile data redirects the entire perf phase: OP isn't the bottleneck (1% on Iron Soldier, doesn't crack top-15 elsewhere); blitter isn't either except on the opt-in accurate path; GPU and DSP RISC interpretation dominate every ROM tested.
This PR is just the doc + the Makefile fix to make commercial ROM paths work in
make benchmark. The headline finding lives indocs/op-perf-profile.md.Profile (TL;DR)
GPUExecDSPExec+ opcodesGPUExec+DSPExecBlitterMidsummer2+ DSP + GPUm68k_execute0.7-2.6%What this changes
Methodology
make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate}— 600 frames + 60 warmup, headless, no presentation/usr/bin/sample <pid> 8 -file <out>over a 6000-frame benchmark runRELEASE_DEBUG_INFO=1Test plan
-g, no inlining lost the obvious targets)Branch base
develop(per GitFlow)